Library Imports

from pyspark.sql import SparkSession
from pyspark.sql import types as T

from pyspark.sql import functions as F

from datetime import datetime
from decimal import Decimal

Template

spark = (
    SparkSession.builder
    .master("local")
    .appName("Section 2.3 - Creating New Columns")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)

sc = spark.sparkContext

import os

data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path

pets = spark.read.csv(path, header=True)
pets.toPandas()

	id	breed_id	nickname	birthday	age	color
0	1	1	King	2014-11-22 12:30:31	5	brown
1	2	3	Argus	2016-11-22 10:05:10	10	None
2	3	1	Chewie	2016-11-22 10:05:10	15	None

Creating New Columns and Transforming Data

When we are data wrangling, transforming data, we will using assign the result to a new column. We will explore the withColumn() function and other transformation functions to achieve this our end results.

We will also look into how we can rename a column with withColumnRenamed(), this is useful for making a join on the same column, etc.

Case 1: New Columns - `withColumn()`

(
    pets
    .withColumn('nickname_copy', F.col('nickname'))
    .withColumn('nickname_capitalized', F.upper(F.col('nickname')))
    .toPandas()
)

	id	breed_id	nickname	birthday	age	color	nickname_copy	nickname_capitalized
0	1	1	King	2014-11-22 12:30:31	5	brown	King	KING
1	2	3	Argus	2016-11-22 10:05:10	10	None	Argus	ARGUS
2	3	1	Chewie	2016-11-22 10:05:10	15	None	Chewie	CHEWIE

What Happened?

We duplicated the nickname column as nickname_copy using the withColumn() function. We also created a new column where all the letters of the nickname are capitalized with chaining multiple spark functions together.

We will look into more advanced column creation in the next section. There we will go into more details what a column expression is and what the purpose of F.col() is.

Case 2: Renaming Columns - `withColumnRenamed()`

(
    pets
    .withColumnRenamed('id', 'pet_id')
    .toPandas()
)

	pet_id	breed_id	nickname	birthday	age	color
0	1	1	King	2014-11-22 12:30:31	5	brown
1	2	3	Argus	2016-11-22 10:05:10	10	None
2	3	1	Chewie	2016-11-22 10:05:10	15	None

What Happened?

We renamed and replaced the id column with pet_id.

Summary

We learned how to create new columns from old ones by chaining spark functions and using withColumn().
We learned how to rename columns using withColumnRenamed().

Section 2.3 - Creating New Columns and Transforming Data

Library Imports

Template

Creating New Columns and Transforming Data

Case 1: New Columns - `withColumn()`

Case 2: Renaming Columns - `withColumnRenamed()`

Summary

results matching ""

No results matching ""

Library Imports

Template

Creating New Columns and Transforming Data

Case 1: New Columns - withColumn()

Case 2: Renaming Columns - withColumnRenamed()

Summary

results matching ""

No results matching ""

Case 1: New Columns - `withColumn()`

Case 2: Renaming Columns - `withColumnRenamed()`